Skip to content

[Feature] Support adding an observer role in FE to improve query load balancing and scale read workloads#734

Open
zhaohehuhu wants to merge 12 commits into
StarRocks:mainfrom
zhaohehuhu:dev-0106
Open

[Feature] Support adding an observer role in FE to improve query load balancing and scale read workloads#734
zhaohehuhu wants to merge 12 commits into
StarRocks:mainfrom
zhaohehuhu:dev-0106

Conversation

@zhaohehuhu

@zhaohehuhu zhaohehuhu commented Jan 6, 2026

Copy link
Copy Markdown
Contributor

Description

Support adding a read-only observer role in FE via the K8s Operator to scale read capacity and improve query load balancing.

associated with this PR(StarRocks/starrocks#67778)

Related Issue(s)

No.

Checklist

For operator, please complete the following checklist:

  • run make generate to generate the code.
  • run golangci-lint run to check the code style.
  • run make test to run UT.
  • run make manifests to update the yaml files of CRD.

For helm chart, please complete the following checklist:

  • make sure you have updated the values.yaml
    file of starrocks chart.
  • In scripts directory, run bash create-parent-chart-values.sh to update the values.yaml file of the parent
    chart( kube-starrocks chart).

Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
@zhaohehuhu zhaohehuhu changed the title Support adding observer role Support adding observer role in FE Jan 6, 2026
@zhaohehuhu zhaohehuhu changed the title Support adding observer role in FE Support adding an Observer role in FE via the Operator Jan 6, 2026
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
@zhaohehuhu

zhaohehuhu commented Jan 7, 2026

Copy link
Copy Markdown
Contributor Author

@yandongxiao @kevincai plz help review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for an Observer role in FE (Frontend) via the Kubernetes Operator to scale read capacity and improve query load balancing. The Observer role is implemented as a configurable field that can be set to specify the number of observer nodes.

Key changes:

  • Added ObserverNumber field to StarRocksFeSpec API with a getter method that defaults to 0
  • Exposed observer number as an environment variable (OBSERVER_NUMBER) to FE pods
  • Updated CRD schemas to include the new observerNumber field
  • Updated Helm chart values.yaml files with commented examples for the new configuration option

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/apis/starrocks/v1/starrockscluster_types.go Adds ObserverNumber field to StarRocksFeSpec and implements GetObserverNumber() method with default value of 0
pkg/apis/starrocks/v1/zz_generated.deepcopy.go Auto-generated deep copy implementation for the new ObserverNumber pointer field
pkg/apis/starrocks/v1/labels.go Adds OBSERVER_NUMBER constant for the environment variable name
pkg/apis/starrocks/v1/load_type.go Minor whitespace cleanup (removed blank lines)
pkg/k8sutils/templates/pod/spec.go Updates Envs function to set OBSERVER_NUMBER environment variable for FE pods using the new field
pkg/k8sutils/templates/pod/spec_test.go Adds test coverage for the OBSERVER_NUMBER environment variable in FE spec
config/crd/bases/starrocks.com_starrocksclusters.yaml Adds observerNumber field definition to the CRD schema
deploy/starrocks.com_starrocksclusters.yaml Adds observerNumber field definition to the deployed CRD schema
helm-charts/charts/kube-starrocks/charts/starrocks/values.yaml Adds commented example for observerNumber configuration
helm-charts/charts/kube-starrocks/values.yaml Adds commented example for observerNumber configuration in parent chart

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/apis/starrocks/v1/starrockscluster_types.go Outdated
Comment thread deploy/starrocks.com_starrocksclusters.yaml Outdated
Comment thread config/crd/bases/starrocks.com_starrocksclusters.yaml Outdated
Comment thread pkg/apis/starrocks/v1/starrockscluster_types.go
Comment thread helm-charts/charts/kube-starrocks/values.yaml Outdated
Comment thread helm-charts/charts/kube-starrocks/charts/starrocks/values.yaml Outdated
@kevincai

kevincai commented Jan 8, 2026

Copy link
Copy Markdown
Collaborator

don't try to use FE_OBSERVER_NUMBER to hack the current fe_entry_point.sh

It is so dangerous that a single improper env var update can ruin the entire cluster.

@zhaohehuhu

Copy link
Copy Markdown
Contributor Author

don't try to use FE_OBSERVER_NUMBER to hack the current fe_entry_point.sh

It is so dangerous that a single improper env var update can ruin the entire cluster.

Got it. In that case, we may need to add a new FE observer component.

Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
@zhaohehuhu zhaohehuhu changed the title Support adding an Observer role in FE via the Operator Support adding an Observer role in FE to improve query load balancing Jan 12, 2026
@zhaohehuhu zhaohehuhu changed the title Support adding an Observer role in FE to improve query load balancing Support adding an Observer role in FE to improve query load balancing and scale read workloads Jan 12, 2026
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
@kyle-goodale-klaviyo

Copy link
Copy Markdown

This is much needed, thank you!

@zhaohehuhu

Copy link
Copy Markdown
Contributor Author

This is much needed, thank you!

I still need some time to test this feature. Thanks!

@zhaohehuhu zhaohehuhu changed the title Support adding an Observer role in FE to improve query load balancing and scale read workloads [Feature] Support adding an Observer role in FE to improve query load balancing and scale read workloads Jan 14, 2026
@zhaohehuhu zhaohehuhu changed the title [Feature] Support adding an Observer role in FE to improve query load balancing and scale read workloads [Feature] Support adding an observer role in FE to improve query load balancing and scale read workloads Feb 12, 2026
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
@zhaohehuhu

Copy link
Copy Markdown
Contributor Author

@kevincai @yandongxiao plz help review. Thanks!

Comment thread pkg/subcontrollers/feobserver/feobserver_config.go Outdated
Comment thread helm-charts/charts/kube-starrocks/charts/starrocks/values.yaml Outdated
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
@zhaohehuhu zhaohehuhu requested a review from emanuilov June 1, 2026 09:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated no new comments.

Files not reviewed (1)
  • pkg/apis/starrocks/v1/zz_generated.deepcopy.go: Language not supported

kevincai
kevincai previously approved these changes Jun 2, 2026
@yandongxiao

Copy link
Copy Markdown
Collaborator

Thanks for the contribution. The overall mechanism — a dedicated observer workload plus the minimal fe_entrypoint.sh change in StarRocks/starrocks#67778 — looks like the right direction. However, I have four concerns. The first one is blocking.

1. Observer pods have no resolvable FQDN, and no Service routes to them (blocking)

MakeStatefulset initially sets ServiceName to <cluster>-fe-observer-search (the new case in service.SearchServiceName), but SyncCluster immediately overrides it:

expectSts.Spec.ServiceName = feSearchServiceName(src.Name) // "<cluster>-fe-search"

That service does exist (created by the FE controller), but its selector is built from load.Selector(src.Name, feSpec), i.e. {app.starrocks.ownerreference/name: <cluster>-fe, app.kubernetes.io/component: fe}, while observer pods carry {<cluster>-fe-observer, fe-observer}. The selector never matches, so observer pods never appear in the Endpoints of <cluster>-fe-search, and cluster DNS never publishes A records for <pod>.<cluster>-fe-search.<ns>.svc.cluster.local.

The pod can still resolve its own FQDN locally (kubelet writes hostname/subdomain into /etc/hosts), so ALTER SYSTEM ADD OBSERVER "$MYSELF:$EDIT_LOG_PORT" is actually sent — but the FE leader cannot resolve that FQDN, BDBJE cannot connect back, and the entrypoint times out and exits → CrashLoopBackOff. Since the operator always sets HOST_TYPE=FQDN, the feature cannot work as currently implemented.

Additionally, because the dedicated observer services were removed, no Service selects observer pods at all — there is no entry point for clients to reach observers, which seems at odds with the stated goal of query load balancing.

Suggestion: keep a dedicated headless search service for the observer StatefulSet (selector = observer labels), which also makes the currently-dead *StarRocksFeObserverSpec case in service.SearchServiceName meaningful again. It is also worth clarifying how clients are expected to reach observers (e.g. a dedicated external service).

2. deleteLegacyObserverServices should be removed

It deletes <cluster>-fe-observer-search / <cluster>-fe-observer-service, which were only ever created by earlier revisions of this PR and never released — no real deployment can have these resources.

It also runs on the enabled path on every reconcile, and clearObserverResources runs on the disabled path (the default) on every reconcile of every cluster. DeleteService/DeleteStatefulset log "delete service/statefulset from kubernetes" before the existence check, so every reconcile of every cluster emits misleading delete logs plus pointless GETs. Consider gating the disabled-path cleanup on src.Status.StarRocksFeObserverStatus != nil.

(Minor: the !apierrors.IsNotFound(err) checks are redundant — DeleteService/DeleteStatefulset already return nil on NotFound.)

3. No deregistration on scale-down / disable

Neither the operator nor fe_entrypoint.sh ever runs ALTER SYSTEM DROP OBSERVER. Scaling observerNumber down or setting enabled: false leaves permanent Alive=false entries in SHOW FRONTENDS. This matches the existing FE follower convention (manual DROP), but observers are precisely the role intended to be scaled dynamically, so dead entries will accumulate much faster. This should at least be documented as a known limitation, with manual cleanup instructions.

4. Old image + observerSpec.enabled=true silently adds FOLLOWERs (dangerous)

IS_FE_OBSERVER is only understood by fe_entrypoint.sh after StarRocks/starrocks#67778. With any older FE image, the env var is ignored and the script falls through to ALTER SYSTEM ADD FOLLOWER — the observer replicas join as voting followers and silently change the BDBJE quorum. This is exactly the failure mode raised earlier in this thread regarding env-var-driven behavior.

Suggestion: add an operator-side image version guard (IsLowerThanAny is already used this way for the BE preStop path) and document the minimum required FE image version for observerSpec.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no drop observer logic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d prefer to add that logic in a separate PR to keep the scope of this PR small. Would that be okay? @yandongxiao

@zhaohehuhu

Copy link
Copy Markdown
Contributor Author

@yandongxiao Thanks for your comments. I will make some improvements.

Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: zhaohehuhu <luoyedeyi@163.com>
@zhaohehuhu

Copy link
Copy Markdown
Contributor Author

@yandongxiao plz help review. Thanks!

@vikasattiguppa

Copy link
Copy Markdown

@yandongxiao can you please review these changes i'm also interested in getting the capability for our starrocks deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants